Pruning of trees

The term 'Noise' when applied to data files can mean a number of things:

·The data fields are corrupted due to typing errors or transmission errors.

·The contents of the data fields contain inaccurate information as a result of human errors in making decisions, or machine errors in logging events.

·The data source does not contain enough data to cover the patterns we are trying to discover.

·The data source does not have sufficient attribute fields to classify the outcome field.

·The data source contains attribute fields which are irrelevant to the classification of the outcome field.

·The patterns which are contained in the data source were changing throughout the time span over which the data was collected.

A tree induced from noisy data will, on the whole, maintain its accuracy near the root of the tree but manifest the effects of noise at the leafs of the tree. The effects of the noise on the leafs of a tree are:

·Branches near the leafs with small numbers of filtered data records. These branches do not represent coherent patterns but are an attempt by the tree to fit the noise.

·Branches near the leafs which are caused by the presence of irrelevant attributes in the absence of relevant attributes .

The localisation of the effects of noise near the leafs presents an opportunity to remove these effects by effective pruning of the tree, by stopping branching according to a given criteria. By pruning the tree we will be simplifying it, to the extent that it does not classify all the records of the development data any more. This is represented by a probability figure associated with every leaf, giving an estimate of the accuracy of the leaf in classifying the development data set (and hence other 'unseen' data sets). Pruning of the tree is controlled by the Minimum Examples in a branch parameter and the F-test cut-off or Chi-square significance level defined before inducing the tree. This is called forward pruning, since the growth of the tree is controlled during the induction process. A value of 1% is recommended for the F-test percentile.